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Abstract —Error Correcting Output Codes (ECOC) is a successful tech¬ 
nique in multi-class classification, which is a core problem in Pattern 
Recognition and Machine Learning. A major advantage of ECOC over 
other methods is that the multi-class problem is decoupled into a set of 
binary problems that are solved independently. However, literature de¬ 
fines a general error-correcting capability for ECOCs without analyzing 
how it distributes among classes, hindering a deeper analysis of pair¬ 
wise error-correction. To address these limitations this paper proposes 
an Error-Correcting Factorization (ECF) method, our contribution is 
three fold: (I) We propose a novel representation of the error-correction 
capability, called the design matrix, that enables us to build an ECOC 
on the basis of allocating correction to pairs of classes. (II) We derive 
the optimal code length of an ECOC using rank properties of the design 
matrix. (Ill) ECF is formulated as a discrete optimization problem, and a 
relaxed solution is found using an efficient constrained block coordinate 
descent approach. (IV) Enabled by the flexibility introduced with the 
design matrix we propose to allocate the error-correction on classes 
that are prone to confusion. Experimental results in several databases 
show that when allocating the error-correction to confusable classes 
ECF outperforms state-of-the-art approaches. 
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Fig. 1. Example of a classification problem of 4 different sports 
balls. Note how One vs. All or Dense Random ECOC designs 
do not take into account the data distribution while the proposed 
Error-Correcting Factorization method finds an ECOC matrix X 
by factorizing a design matrix D. In addition, the codes (rows of 
X) ECF assigns to similar categories are very dissimilar in order 
to benefit from Error-Correcting principles. 


Index Terms —Error-Correcting Output Codes, Multi-class learning, Ma¬ 
trix Factorization 


1 Introduction 

In the last decade datasets have experimented an exponential 
growth rate, generating vast collections of data that need to 
automatically be analyzed. In particular, multimedia datasets 
have experienced an explosion on data availability, thanks 
to the almost negligible cost of gathering multi-media data 
from Internet. Therefore, there is a pushing need for efficient 
algorithms that are able to automatize knowledge extraction 
processes on those datasets. One of the classic problems in 
Pattern Recognition and Machine Intelligence is to perform 
automatic classification, i.e., automatically attributing a label 
to each sample of the dataset. In this sense, the classification 
process is often considered as first step for higher order rep¬ 
resentations or knowledge extractions. In multi-class classifi¬ 
cation problems the goal is to find a function / : R n —>> K, 
that maps samples to a finite discrete set K of labels with 
|K| >2. While there exists a large set of approaches to estimate 
/ all of them can be grouped in two different categories: 
Single-Machine/Single-Loss approaches and Divide and Conquer 
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approaches. The formers attempt to approximate a single / for 
the complete multi-class problem, while the latter decouple / 
into a set of binary sub-functions (binary classifiers) that are 
potentially easier to estimate and aggregate the results. 

In this sense, Error-Correcting Output Codes (ECOC) is a 
divide and conquer approach that has proven to be very effec¬ 
tive in many different multi-class contexts. The core property 
within an ECOC is its capability to correct errors in binary 
classifiers by using redundancy. However, existing literature 
represents the error-correcting capability of an ECOC as an 
scalar, hindering a deeper the analysis of error-correction and 
redundancy on class pairs. Furthermore, classical divide and 
conquer approaches that have been included in the ECOC 
framework like One vs. All [481 or Random 0 approaches 
ignore the data distribution, thus not taking profit of allocat¬ 
ing the error-correcting capabilities of ECOCs in a problem- 
dependent fashion. In addition, recent problem-dependent 
ECOC designs have focused on designing the binary sub¬ 
functions rather than analyzing the core error-correcting prop¬ 
erty. In order to overcome this limitations, our proposal builds 
an ECOC matrix by factorizing a design matrix D that encodes 
the desired 'correction properties' between classes (i.e a design 
matrix which can be obtained directly from data or be set by 
experts on the problem domain). The proposed method finds 
the ECOC coding that yields the closest configuration to the 
design matrix. We cast the task of designing an ECOC as a 
matrix factorization problem with binary constraints. A visual 
example is shown in Figure [l] 
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2 Related Work 

2.1 Single-machine/Single-loss Approaches 

The multi-class problem can be directly treated by some 
methods that exhibit a multi-class behaviour off the shelf (i.e 
Nearest Neighbours [22], Decision Trees (30l , Random Forests 
ED. However, some of the most powerful methods for binary 
classification like Support Vector Machines (SVM) or Adaptive 
Boosting (AdaBoost) can not be directly extended to the multi¬ 
class case and further development is required. In this sense, 
literature is prolific on single-loss strategies to estimate /. 
One of the most well know approaches are the extensions of 
SVMs CD to the multi-class case. For instance, the work of 
Weston and Watkins (55) presents a single-machine extension 
of the SVM method to cope with the multi-class case, in 
which k predictor functions are trained, constrained with k — 1 
slack variables per sample. However, a more recent adaptation 
of GH reduces the number of constraints per samples to 
one, paying only for the second largest classification score 
among the k predictors. To solve the optimization problem 
a dual decomposition algorithm is derived, which iteratively 
solves the quadratic programming problem associated with 
each training sample. Despite these efforts, single-machine 
approaches to estimate / scale poorly with the number of 
classes and are often outperformed by simple decompositions 
(48l , l52l . In recent years various works that extended the clas¬ 
sical Adaptive Boosting method (20l to the multi-class setting 
have been presented 1511 , [431. hi 1621 the authors directly 
extend the AdaBoost algorithm to the multi-class case without 
reducing it to multiple binary problems, that is estimating a 
single / for the whole multi-class problem. This algorithm is 
based on an exponential loss function for multi-class classifi¬ 
cation which is optimized on a forward stage-wise additive 
model. Furthermore, the work of Saberian and Vasconcenlos 
EDI presents a derivation of a new margin loss function for 
multi-class classification altogether with the set of real class 
codewords that maximize the presented multi-class margin, 
yielding boundaries with max margin. However, though these 
methods are consistently derived and supported with strong 
theoretical results, methodologies that jointly optimize a multi¬ 
class loss function present some limitations: 

• They scale linearly with k, rendering them unsuitable for 
problems with a large k. 

• Due to their single-loss architecture the exploitation of par¬ 
allelization on modern multi-core processors is difficult. 

• They can not recover from classification errors on the class 
predictors. 

2.2 Divide and Conquer Approaches 

On the other hand, the divide and conquer approach has 
drawn a lot of attention due to its excellent results and easily 
parallelizable architecture @§1, |52|, 0, ED/ GEL 0/ EoL 
(28]. In this sense, instead of developing a method to cope 
with the multi-class case, divide and conquer approaches 
decouple / into a set of l binary problems which are treated 
separately. Once the responses of binary classifiers are obtained 
a committee strategy is used to find the final output. In this 
trend one can find three main lines of research: flat strategies, 
hierarchical classification, and ECOC. Flat strategies like One 
vs. One E>2 and One vs. All [481 are those that use a predefined 
problem partition scheme followed by a committee strategy 
to aggregate the binary classifier outputs. On the other hand, 
hierarchical classification relies on a similarity metric distance 


among classes to build a binary tree in which nodes correspond 
to different problem partitions |23|, (40l , 03. Finally the ECOC 
framework consists of two steps: In the coding step, a set of 
binary partitions of the original problem are encoded in a 
matrix of discrete codewords 116l (univocally defined, one code 
per class) (see Figure [2j. At the decoding step a final decision 
is obtained by comparing the test codeword resulting of the 
union of the binary classifier responses with every class code¬ 
word and choosing the class codeword at minimum distance 
mm. The coding step has been widely studied in literature, 
yielding three different types of codings: predefined codings 
148), (52f, random codings 0 and problem-dependent codings 
for ECOC (T51 , l46l , EL l57l , l24l . l58l . Predefined codings like 
One vs. All or One vs. One are directly embeddable in the 
ECOC framework. In 0, the authors propose the Dense and 
Sparse Random coding designs with a fixed code length of 
{10,15}log 2 (iT), respectively. In 0 the authors encourage to 
generate a set of 10 4 random matrices and select the one that 
maximizes the minimum distance between rows, thus showing 
the highest correction capability. However, the selection of a 
suitable code length l still remains an open problem. 

2.3 Problem-dependent Strategies 

Alternatively, problem-dependent strategies for ECOC have 
proven to be successful in multi-class classification tasks [57], 
l23l , l24l , l58l , (T51 , l60l , l59l , l46l . A common trend of these 
works is to exploit information of the multi-class data distri¬ 
bution obtained a priori in order to design a decomposition 
into binary problems that are easily separable. In that sense, 
IHZ] computes a spectral decomposition of the graph laplacian 
associated to the multi-class problem. The expected most sepa¬ 
rable partitions correspond to the thresholded eigenvectors of 
the laplacian. However, this approach does not provide any 
warranties on defining unequivocal codewords (which is a 
core property of the ECOC coding framework) or obtaining 
a suitable code length l. In (24 J, Gao and Roller propose a 
method which adaptively learns an ECOC coding by opti¬ 
mizing a novel multi-class hinge loss function sequentially. 
On an update of their earlier work, Gao and Roller propose 
in ESI a joint optimization process to learn a hierarchy of 
classifiers in which each node corresponds to a binary sub¬ 
problem that is optimized to find easily separable subproblems. 
Nonetheless, although the hierarchical configuration speeds up 
the testing step, it is highly prone to error propagation since 
node mis-classifications can not be recovered. Finally, the work 
of Zhao et. al on proposes a dual projected gradient method 
embedded on a constrained concave-convex procedure to opti¬ 
mize an objective composed of a measure of expected problem 
separability, codeword correlation and regularization terms. In 
the light of these results, a general trend of recent works is to 
optimize a measure of binary problem separability in order to 
induce easily separable sub-problems. This assumption leads 
to ECOC coding matrices that boost the boundaries of easily 
separable classes while modeling with low redundancy the 
ones with most confusion. 

2.4 Our approach 

In this paper we present the Error-Correcting Factorization 
(ECF) method for factorizing a design matrix of desired 'error- 
correcting properties' between classes into a discrete ECOC 
matrix. The proposed ECF method is a general framework for 
the ECOC coding step since the design matrix is a flexible 
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(a) (b) (c) 

Fig. 2. (a) SVM RBF boundaries learned from Error-Correcting Factorization along with the ECOC coding matrix X in a Toy 
problem, 77.12% classification accuracy (12 classifiers are trained), (b) Boundaries learned by the Dense Random ECOC coding 
design, 66.45% classification accuracy (12 classifiers are trained), (c) SVM boundaries induced by the One vs. All approach, 49.53% 
classification accuracy (14 classifiers are trained). 


tool for error-correction analysis. In this sense, the problem of 
designing the ECOC matrix is reduced to defining the design 
matrix, where higher level reasoning may be used. For exam¬ 
ple, following recent state-of-the-art works one could build a 
design matrix following a "hard classes are left behind" spirit, 
boosting the boundaries of easily separable classes and disre¬ 
garding the classes that are not easily separable. An alternative 
for building the design matrix is the "no class is left behind" 
criteria, where we may boost those classes that are prone to be 
confused in the hope of recovering more errors. Note that the 
design matrix could also directly encode knowledge of domain 
experts on the problem, providing a great flexibility on the 
design of the ECOC coding matrix. Figure [5] shows different 
coding schemes and the real boundaries learned by binary 
classifiers (SVM with RBF kernel) for a Toy problem of 14 
classes (see section [5] for further details on the dataset). We can 
see how the binary problems induced by ECF in Fig. [2ja) boost 
the boundaries of classes that are prone to be confused, while 
other approaches that use equal or higher number of classifiers 
like Dense Random 0 in Fig. [2jb), or classic One vs. All 
designs in Fig. |2jc) fail in this task. The paper is organized as 
follows: Section^] introduces the ECOC properties and derives 
ECF, where we cast the problem of finding an ECOC matrix 
that follows a certain distribution of correction as a discrete 
optimization problem. Section [I] presents a discussion of the 
method addressing important issues from the point of view 
of the ECOC framework. Concretely, we derive the optimal 
problem-dependent code length for ECOCs obtained by means 
of ECF, which to the best of our knowledge is the first time 
this question is tackled in the extended ECOC literature. In 
addition, we show how ECF converges to a solution with 
negligible objective value when the design matrix follows 
certain constraints. Section [5] shows how ECF yields ECOC 
coding matrices that obtain higher classification performances 
than state-of-the-art methods with comparable or lower com¬ 
putational complexity Finally, Section [6] concludes the paper. 

3 Methodology 

In this section, we review existing properties of the ECOC 
framework and propose to cast the ECOC coding matrix opti¬ 
mization as a Matrix Factorization problem that can be solved 
efficiently using a constrained coordinate descent approach. 


3.1 Error-Correcting Output Codes 

ECOC is a multi-class framework inspired on the basis of error- 
correcting principles of communication theory (16) , which is 
composed of two different steps: coding (16), 0 and decoding 
HZ), ED- At the coding step an ECOC coding matrix X £ 
{— l,+l} kxl (see notatiorQ is constructed, where k denotes 
the number of classes in the problem and l the number of 
bi-partitions (also known as dichotomies) to be learnt. In the 
coding matrix, the rows (x z, s, also known as codewords) are 
unequivocally defined, since these are the identifiers of each 
category in the multi-class problem. On the other hand, the 
columns of X (x/s) denote the bi-partitions to be learnt by 
base classifiers (also known as dichotomizer). Therefore, for a 
certain column a dichotomizer learns the boundary between 
classes valued +1 and classes valued —1. However, (2) intro¬ 
duced a third value, defining ternary valued coding matrices. 
X £ { —l,+l,0} fexZ . In this case, for any given dichotomy 
categories can be valued as +1 or —1 depending on the 
meta-class they belong to, or 0 if they are ignored by the 
dichotomizer. This new value allows the inclusion of well- 
known decomposition techniques into the ECOC framework, 
such as One vs. One (52) . 

At the decoding step a data sample s is classified among 
the {ci,...,cfc} possible categories. In order to perform the 
classification task, each dichotomizer predicts a binary value 
for s whether it belongs to one of the bi-partitions defined 
by the correspondent dichotomy Once the set of predictions 
y £ { —1,+1} Z is obtained, it is compared to the rows of X 
using a distance function 5, known as the decoding function. 
Usual decoding techniques are based on well-known distance 
measures such as the h or Euclidean distance. These measures 
are proved to be effective for X £ {+1, — l} kxl . Nevertheless, 
it is not until the work of (17) that decoding functions took 
into account the meaning of the 0 value at the decoding step. 
Generally, the final prediction for s is given by the class a, 
where argmin J(x% y), i £ {1,..., k}. 

i 

1. Bold capital letters denote matrices (e.g. X), bold lower-case letters 
represent vectors (e.g., x). All non-bold letters denote scalar variables. 
x r is the i —th row of the matrix X. xj is the j— th column of the matrix 
X. 1 is a matrix or vector of all ones of the appropriate size. Xij denotes 
the scalar in the i —th row and j —th column of X. ||X||_p = tr(X T X) 
denotes the Frobenius norm. || • || p is used to denote the Lp-norm. x®y 
is an operator which concatenates vectors x and y . rank(X) denotes 
the rank of X. X < 0 denotes the point-wise inequality 
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3.2 Good practices in ECOC 

Several works have studied the characteristics of a good ECOC 
coding matrix [16j, l36l , (3l , [571, 0, which are summed up in 
the following three properties: 

1) Correction capability: let H G R kxk denote a symmetric 

matrix of hamming distances among all pairs of rows in 
X, the correction capability is expressed as |^ mm (H)-i j | 2 j 
considering only off-diagonal values of H. In this sense, 
if min(H) = 3, ECOC will be able to recover the correct 
multi-class prediction even if — 1 binary classifier 

misses its prediction^] 

2) Uncorrelated binary sub-problems: the induced binary 
problems should be as uncorrelated as possible for X to 
recover binary classifier errors. 

3) Use of powerful binary classifiers: since the final class 
prediction consists of the aggregation of bit predictors, 
accurate binary classifiers are also required to obtain 
accurate multi-class predictions. 

3.3 From global to pair-wise correction capability 

In literature, correction capability has been a core objective 
of problem-dependent designs of X. In this sense, different 
authors have always agreed on defining correction capability 
for an ECOC coding matrix as a global value (16], 0, |36|, 
|57|, l23l , |25|. Hence, min(H) is expected to be large in order 
for X to recover from as many binary classifier errors as 
possible. However, since H expresses the hamming distance 
between rows of X, one can alternatively express the correction 
capability in a pair-wise fashion 0, allowing for a deeper 
understanding of how correction is distributed among code¬ 
words. Figure [3] shows an example of global and pair-wise 
correction capabilities calculation. Recall that the ® operator 
between two vectors denotes its concatenation. Thus, the pair¬ 
wise correction capability is defined as follows: 

4) The pair-wise correction capability of codewords x 1 
and x j is expressed as: |_ mm ^ h ^ hJ )~ 1 j / where we only 
consider off-diagonal values of H. This means that a 
sample of class q is correctly discriminated from class 
Cj even if |^ mm ( h 0h J ))-i j bi nar y classifiers miss their 
predictions. 

Note that though in Figure [3] the global correction capability 
of X is 0, there are pairs of codewords with a higher correction, 
e.g. x 2 3 * and x 8 . In this case the global correction capability 
as defined in literature is overlooking ECOC coding charac¬ 
teristics that can potentially be exploited. This novel way of 
expressing the correction capability of an ECOC matrix enables 
a better understanding of how ECOC coding matrices dis¬ 
tribute their correction capability, and gives an insight on how 
to design coding matrices. In this sense, it is straightforward 
to demand the correction capabilities of the ECOC matrix to 
be allocated according to those classes that are more prone to 
error, in order for them to have better recovery behavior (i.e. 
following a "no class is left behind" criteria). However, recent 
works ei, (m ei have focused on designing a matrix X 
where binary problems are easily separable. This assumption 
leads to a matrix X where classes that are not easily separable 

2. In the case of ternary codes this correction capability can be easily 
adapted. 

3. Note that for X to be valid all off-diagonal elements of H should 

be greater or equal than one. 


X H 



Fig. 3. Example of global versus pair-wise correction capability. 
On the left side of the figure the calculation of the global 
correction capability is shown. The right side of the image shows 
a sample of pair-wise correction calculation for codewords x 2 
and x 8 . 

show a small hamming distance on their respective codewords 
(i.e. following a "hard classes are left behind" scheme). 

In addition to the proposal of a general method for ECOC 
coding by means of the definition of a design matrix, we 
explore the effect of focusing the learning effort of our method 
in those classes that have complex boundaries (i.e. those which 
show a small inter-class margin). It is important to take into 
account that though it is natural to estimate the design matrix 
from training data, it is not a limitation of ECF. In this sense, 
the design matrix can also code information of experts or any 
other distance measure directly set by the user. Formally, let 
X G {—l,+l} fexZ be a coding matrix, let H be a symmetric 
matrix of pair-wise h distances between rows of X and let 
D G R kxk be a design matrix (e.g. pair-wise distance measure 
between class codewords). It is natural to see that the ordinal 
properties of the distance should hold in H and D. Thus, if 
distance between codewords x fc and x 1 ( dki ) is required to be 
larger than the distance between codewords x z and x j (dij), 
this order should be maintained in H. Then we want to find a 
configuration of X such that hij < hki dij < dkiVij,k,i- 

Note that the l\ distances in H can be seen as a function 
of the dot product of the codewords \\x l — x J ||i = ~^ x xJ — 
where x G {— 1,+1} . Therefore, instead of directly requiring 
H to match D, we can equivalently require the product XX 
to match D (54). This implies that we can cast the problem of 
finding X into a Matrix Factorization problem, where we find 
an X so that the matrix of inner products XX T is closest to D 
under a given norm. 


3.4 Error-Correcting Factorization 

This section describes the objective function and the optimiza¬ 
tion strategy for the ECF algorithm. 

3.4.1 Objective 

Our goal is to find an ECOC coding matrix that encodes the 
properties denoted by the design matrix D. In this sense, ECF 
seeks a factorization of the design matrix D G R kxk into a 
discrete ECOC matrix X. This factorization is formulated as 
the quadratic form XX that reconstructs D with minimal 
Frobenius distance under several constraints, as shown in 
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Equation ad 


minimize 

X 

||D-XX t ||! 

(1) 

subject to 

X€ {-l,+l} fcx * 

(2) 


XX T -P < 0 

(3) 


X T X-1(Z-1) <0 

(4) 


-X T X- 1(1- 1) < 0 

(5) 


The component X* £ {—1, +l} fcxZ that solves this optimiza¬ 
tion problem generates the inner product of discrete vectors 
that is closest to D under the Frobenius norm. In order for X 
to be a valid matrix under the ECOC framework we constraint 
X in Equations Equation {3} ensures that each binary 

problem classes will belong to one of the two possible meta¬ 
classes. In addition, to avoid the case of having two or more 
equivalent rows in X, the constraints in [3] ensure that the 
correlation between rows of X less or equal than a certain 
user-defined matrix — 11 < P < 1/ (recall that 1 denotes a 
matrix or vector of all Is of the appropriate size when used), 
where P encodes the minimum distance between any pair of 
codewords. P is a symmetric matrix with pa = l Vi Thus, 
by setting the off diagonal values in P we can control the 
minimum inter-class correction capability. Hence, if we want 
the correction capability of rows x z and x j to be L^r 1 ]/ we set 

p* = p 3 = i o -<=)• 

Finally, constraints in Equations {IJ and § ensure the in¬ 
duced binary problems are not equivalent. Similar constraints 
have been studied thoroughly in literature [16], |36|, |25| 
defining methods that rely on diversity measures for binary 
problems to obtain a coding matrix X. Equations 0 and (3} 
can be considered as soft-constraints since its violation does not 
imply violating the ECOC properties in terms of row distance. 
This is easy to show since a coding matrix X £ { — l,4-l} fcxZ 
that induces some equivalent binary problems but ensures that 
XX T < 1(1 — 1), will define a matrix whose rows 

are unequivocally defined. In this sense, a coding matrix X can 
be easily projected on the set defined by constraints ® and |3} 
by eliminating repeated columns, X = Xj : Xj / XjVj / i. 
Thus, constraints in [4] and [5] ensure that uncorrelated binary 
sub-problems will be defined in our coding matrix X. The 
discrete constraint in Equation [3] on the variable elevates 
the optimization problem to the NP-Hard class. To overcome 
this issue and following [I^, (58], 0 we relax the discrete 
constraint in [3] an replace it by X £ [—1, +l] fexZ in Equation [3] 

3.4.2 Optimization 

In this section, we detail the process for optimizing X. The min¬ 
imization problem posed in Equation JlJ with the relaxation 
of the boolean constraint in Equation {2} is non-convex, thus, 
X* is not guaranteed to be a global minimum. In this sense, 
although gradient descent techniques have been successfully 
applied in the literature to obtain local minimums [49], |35|, 
m these techniques do not enjoy the efficiency and scalability 
properties present in other optimization methods applied to 
Matrix Factorization problems, such as Coordinate Descent 
Ezl, HD. Coordinate Descent techniques have been widely 
applied in Nonnegative Matrix Factorization obtaining satis¬ 
fying results in terms of efficiency l34l , [31]. In addition, it has 

4. Recall that the h distance is a function of the dot product ||x z — 

x j IL _ ~( x 'x jT ) + * 
x 111 — 2 


been proved that if each of the coordinate sub-problems can 
be solved exactly. Coordinate Descent converges to a stationary 
point [29], 1551 . Using this result, we decouple the problem in 
Equation {l} into a set of linear least-squares problems (one 
for each coordinate). Therefore, if the problem in Equation 
0 is going to be minimized along the i —th coordinate of 
X, we fix all rows of X except of x* and we substitute X 

r x * i 

with H in Equations {lj and ([3j, where X.' 1 denotes matrix 

X after removing the i —th row. In addition, we substitute D 

with \t , where D',\ denotes the matrix D after re- 

_a D/^J 

moving the i —th row and column. Equivalently, we substitute 

r i pi i 

P= ,i , obtaining the following block decomposition: 


minimize 


subject to 


l d, 

iT d;*j x ,i x j 

x^Hc+i]' 

xV T X ,,: x t 

x'V T x ,i x ,iT 


X'U; 

X'ix'i' 



" l 

Pi 

1_ 

1 

|y T 

-p n 

r /i. 


( 6 ) 
(7) 

<0. (8) 


Analyzing the block decomposition in Equation {6} we can 
see that the onl^ terms involving free variables are x*x* T , 
X! l x % and X ,l x l . Thus, since D and XX T are symmetric by 
definition, the minimizer x 1 * of Equation {6} is the solution to 
the linear least-squares problem shown in Equation 0: 


minimize 

X'V -d* f 

(9) 


II 11 2 

subject to 

-1 < x* < +1 

(10) 


1 

IA 

0 

(11) 


where constraint (To) is the relaxation of the discrete con¬ 
straint {3}. In addition, constraint © ensures the correlation 
of x 1 with the rest of the rows of X is below a certain value 
ph Algorithm [l] shows the complete optimization process. 


Algorithm 1: Error-Correcting Factorization Algorithm. 

Data: D £ R kxk ,P £ N kxk ,l 
Result: X £ {-l,+l} fcxi 

begin 

repeat 

foreach i £ {1, 2,..., k} do 

x 1 A- minimize ||X /z x* — d*||^ , subject to : 

|_ -1 < x< < +1, X'V - p* < 0; 

X e-suboptimal(X); 

X = {xj : Xj ^ XiMj ^ i}; / / Projection step 
to remove duplicate columns 
until convergence ; 


To solve the minimization problem in Algorithm [T] we use 
the Active Set method described in (26], which finds an ini¬ 
tial feasible solution by first solving a linear programming 
problem. Once ECF converges to a solution X* with objective 
value fobjpt*) we obtain a discretized e-suboptimal solution 
X £ {—1,+1} with objective value f Q bj (X) by sampling 
1000 points that split the interval [—1,4-1] and choosing the 
point that minimizes 11 f Q bj (X*) — f Q bj (X) 11 2 . Finally, we discard 
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repeated columns if any appear ^ 

3.5 Connections to Singular Value Decomposition, Near¬ 
est Correlation Matrix and Discrete Basis problems 

Similar objective functions to the one defined in the ECF 
problem in Equation ([l} are found in other contexts, for ex¬ 
ample, in the Singular Value Decomposition problem (SVD). 
The SVD uses the same objective function as ECF subjected to 
the constraint XX = I. However, the solution of SVD yields 
an orthogonal basis, disagreeing with the objective defined in 
Equation |l} which ensures different correlations between the 
x z, s. In addition, we can also find a common ground with 
the Nearest Correlation Matrix (NMC) Problem |32|, (9), |39|. 
However, the NMC solution does not yield a discrete factor X, 
instead it seeks directly for the Gramian XX T where X is not 


discrete, as in Equation (\2). 



minimize 

x 

l|X-D||! 

(12) 

subject to 

X X 0 

(13) 


cXc T = b 

(14) 


In addition, the ECF has similarities with the Discrete Basis 
Problem (DBP) l42l . since the factors are X discrete valued. 
Nevertheless, DBP factorizes D G {0, i} fcxfc instead of D e 
R kxk , as show in Equation (l5) . 

minimize ||XoY — Dili (15) 

X, Y 

subject to X, Y, D G {0,1} (16) 

4 Discussion 

In this section we discuss how to ensure that the design matrix 
D is valid, as well as how to automatically estimate the code 
length for each problem given D. Furthermore, we analyze the 
convergence of ECF in relation to the order of updating the 
coordinates. Finally we show that under certain conditions of 
D ECF converges to a solution with almost negligible objective 
value. 

4.1 Ensuring a representable design matrix 

An alternative interpretation for ECF is that it seeks for a 
discrete matrix X whose Gramian is closest to D under the 
Frobenius norm. However, since D can be directly set by the 
user we need to guarantee that D is a correlation matrix that 
is realizable in the R kxk space, that is, D has to be symmetric 
and positive semi-definite. In particular, we would like to find 
the correlation matrix D G R kxk that is closest to D under 
the Frobenius norm. This problem has been treated in several 
works E2, 13, ED, G3, resulting in various algorithms that 
often use an alternating projections approach. However, for this 
particular case in addition to be in the Positive Semidefinite 
(PSD) Cone and symmetric we also require D to be scaled 
in the [—Z, +Z] range, with 5u = Ni. In this sense, to find 
D we follow an alternating projections algorithm, similar as 
ED, which is shown in Algorithm [5] We first project D into 
the PSD cone by computing its eigenvectors and recovering 
D = V diag(A + )V T , where A + are the non-negative eigen¬ 
values of D. Then, we scale D in the range [—Z, +Z] and set 
Su = Ni. 

5. In all our runs of ECF this situation happened with a chance of 
less than 10 -5 %. 


Algorithm 2: Projecting D into the PSD cone with addi- 
tional constraints. _ 

Data: DgM^ 

Result: t> eR kxk 
begin 
repeat 


D -f 

- Vdiag(A+)V T ; 

D -f 

-D G 

H,+/r xfc , 

D 4- 

— da — 

Ni; 


until convergence; 


4.2 Defining a code length with representation guarantees 

The definition of a problem-dependent ECOC code length l, 
that is, choosing the number of binary partitions for a given 
multi-class task is a problem that has been overlooked in 
literature. For example, predefined coding designs like One vs. 
All or One vs. One have fixed code length. On the other hand, 
coding designs like Dense or Sparse Random codings (which 
are very often used in experimental comparisons |57|, |581, (4), 
(HI) are suggested (2) to have a code length of \10log2(k)] 
and \lblog2(k)~\ respectively. These values are arbitrary and 
unjustified. Additionally, to build a Dense or Sparse Random 
ECOC matrix one has to generate a set of 1000 matrices and 
chose the one that maximizes min(H). Consider the Dense 
Random Coding design, of length l — [10 log 2 (/c)~|, the ECOC 
matrix will have in the best case a correction capability of 
[^J = 4, independently of the distribution of the multi-class 
data. In addition, the effect of maximizing min(H) leads to an 
equi-distribution of the correction capability over the classes. 
Other approaches, like Spectral ECOC |57| search for the code 
length by looking at the best performance on a validation set. 
Nevertheless, recent works have shown that the code length 
can be reduced to of l = log 2 (k) with very small loss in 
performance if the ECOC coding design is carefully chosen 
m and classifiers are strong. In this paper, instead of fixing 
the code length or optimizing it on a validation subset, we 
derive the optimal length according to matrix rank properties. 
Consider the rank of a factorization of D into XX T , there are 
three different possibilities: 

1) If rank(XX T ) = rank(D), we obtain rank factorization 
algorithm that should be able to factorize D with minimal 
error. 

2) In the case when rank(XX T ) < rank(D) we obtain a 
low-rank factorization method that cannot guarantee to 
represent D with 0 error, but reconstructs the components 
of D with higher information. 

3) If rank(XX T ) > rank(D), the system is overdetermined 
and many possible solutions exist. 

In general we would like to reconstruct D with minimal 
error, and since rank(X) < min (k, l) and k (the number of 
classes) is fixed, we only have to set the number of columns 
of X to control the rank. Hence, by setting rank(X) = l = 
rank(D), ECF will be able to factorize D with minimal error. 
Figure [I] shows visual results for the ECF method applied on 
the Traffic and ART ace datasets. Note how, for the Traffic (36 
classes) and ARFaces (50 classes) datasets the required code 
length for ECF to full rank factorization is l = 6 and l = 8, 
respectively as shown in Figures [4je)(f). 
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(c) 



‘i — 1,1 

2 4 6 


(d) 

X 



Fig. 4. D matrix for the Traffic (a) and ARFace (b) datasets. 
XX T term obtained via ECF for Traffic (c) and ARFace (d) 
datasets. ECOC coding matrix X obtained with ECF for Traffic 
(e) and ARFace (f). 


4.3 Order of Coordinate Updates 

Coordinate Descent has been applied in a wide span of prob¬ 
lems obtaining satisfying results. However, the problem of 
choosing the coordinate to minimize at each iteration still 
remains active |47|, ED, ED, ED In particular, |44| derives a 
convergence rate which is faster when coordinates are chosen 
uniformly at random rather than on a cyclic fashion. Hence, 
choosing coordinates at random its a suitable choice when the 
problem shows some of the following characteristics l47l : 

• Not all data is available at all times. 

• A randomized strategy is able to avoid worst-case order 
of coordinates, and hence might be preferable. 

• Recent efforts suggest that randomization can improve the 
convergence rate (44l . 

However, the structure of ECF is different and calls for 
a different analysis. In particular, we remark the following 
points, (i) At each coordinate update of ECF, information about 
the rest of coordinates is available, (ii) Since our coordinate 
updates are solved uniquely, a repetition on a coordinate 
update does not change the objective function, (iii) The descent 
on the objective value when updating a coordinate is maximal 
when all other coordinates have been updated. These reasons 
leads us to choose a cyclic update scheme for ECF. In addition 
in Figure [5] we show a couple of examples in which the cyclic 


order of coordinates converges faster than the random order for 
two problems: Vowel and ARFace (refer to Section [5] for further 
information on the datasets). This behavior is common for all 
datasets. In particular, note how the cyclic order of coordinates 
reduces the standard deviation on the objective function, which 
is denoted by the narrower blue shaded area in Figure [5] 


Vowel dataset 


ARFace dataset 



(a) (b) 

Fig. 5. Mean Frobenius norm value with standard deviation as 
a function of the number of coordinate updates on 50 different 
trials. The blue shaded area corresponds to cyclic update while 
the red area denotes random coordinate updates for Vowel (a) 
and ARFAce (b) datasets. 


4.4 Approximation Errors and Convergence results when 
D is an inner product of binary data 

The optimization problem posed by ECF in Equation (l} is non- 
convex due to the quadratic term XX T , even if the discrete 
constraint is relaxed. This implies that we cannot guarantee 
that the algorithm converges to the global optima. Recall that 
ECF seeks for the term XX T that is closest to D under the 
Frobenius norm. Hence, the error in the approximation can be 
measured by ||X*X* T — D|| 0, where X* is the local optimal 

point to which ECF converges. In this sense, we introduce D s 
which is the matrix of inner products of discrete vectors that 
is closest to D under the Frobenious norm. Thus, we expand 
the norm as in the following equation: 

||X*X* T - Dill = ||X*X* T - D s + D b - D||| = (17) 

= ||X*X* T -D s |||+||D-D s ||| - (18) 

—2tr((X*X* T — D b )(D — D s )). (19) 

• The optimization error £ 0 : measured as the distance be¬ 
tween the local optimum where ECF converges and D s 
denoted by £ 0 = ||X*X* T — D s ||^, which is expressed as 
the first term in Equation flS) . 

• The discretization error £d' computed as, £d = ||D — U B \\ 2 F/ 
that is, the distance between D and the closest inner 
product of discrete vectors D s , expressed as the second 
term in Equation |l8j. 

In order to better understand how ECF works we analyze 
both components separately Then, to analyze if ECF converges 
to a good solution in terms of Frobenius norm we set Ed — 0 by 
generating a matrix D = D s which is the inner product matrix 
of random discrete vectors, and thus, all the terms except of 
||X*X* T -d b \\% are zero. By doing that, we can empirically 
observe the magnitude of the optimization error e 0 . In order 
to do that we run ECF 30 times on 100 different D s matrices 
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ECF - 10 classes 


ECF - 1 00 classes 



Number of update cycles 


5 10 1 

Number of update cycles 



Fig. 6. (Mean objective value and standard deviation for 30 runs 
of ECF on a random D G of 10 classes (a), 100 classes (b), and 
500 classes (c). (d) Toy problem synthetic data, where each color 
corresponds to a different category in the multi-class problem. 


of different sizes and calculate the average £" 0 - Figure [ 6 ] shows 
examples for different D G matrices of size 10 x 10 , 100 x 100 , 
and 500 x 500. In Figure [ 6 ] we can see how ECF converges to 
a solution with almost negligible optimization error after 15 
iterations. In fact, the average objective value for all 3000 runs 
of ECF on different D s, s after 15 update cycles (coordinate 
updates for all x*'s) is £ - 0 < 10 -10 . This implies, that ECF 
converges in average to a point with almost negligible objective 
value, and when applied to D's which are not computed from 
binary components the main source of the approximation error 
is the discretization error £&. Since ECF seeks to find a discrete 
decomposition of D this discretization error is unavoidable, 
and as we have seen empirically, ECF converges in average to 
a solution with almost negligible objective value. 

5 Experiments 

In this section we present the experimental results of the 
proposed Error-Correcting Factorization method. In order to 
do so, we first present the data, methods and settings. 

5.1 Data 

The proposed Error-Correcting Factorization method was ap¬ 
plied to a total of 8 datasets. In order to provide a deep analysis 
and understanding of the method, we synthetically generated 
a Toy problem consisting of k = 14 classes, where each class 
contained 100 two dimensional points sampled from a Gaus¬ 
sian distribution with same standard deviation but different 
means. Figure [ 6 jd) shows the synthetic multi-class generated 
data, where each color corresponds to a different category 
We selected 5 well-known UCI datasets: Glass, Segmentation, 
Ecoli, Yeast and Vowel that range in complexity and number 
of classes. Finally, we apply the classification methodology 
in two challenging computer vision categorization problems. 
First, we test the methods in a real traffic sign categorization 
problem consisting of 36 traffic sign classes. Second, 50 classes 


TABLE 1 

Dataset characteristics. 



Glass 

Segment. 

Ecoli 

Yeast 

Vowel 

Toy 

Traffic 

ARFace 

#s 

214 

2310 

336 

1484 

990 

400 

3481 

1300 

#f 

9 

19 

8 

8 

10 

2 

100 

120 

#c 

7 

7 

8 

10 

11 

14 

36 

50 






mm 

@ 1 

m 


H 

3 




(a) (b) 

Fig. 7. Visual examples for the ARFace and Traffic datasets. 


from the ARFaces m dataset are classified using the present 
methodology These datasets are public upon request to the 
authors. Table [l] shows the characteristics of the different 
datasets. 

•Traffic sign categorization: We test ECF on a real traffic 
sign categorization problem, of 36 classes [10]. The dataset 
contains a total of 3481 samples of size 32x32, filtered using the 
Weickert anisotropic filter, masked to exclude the background 
pixels, and equalized to prevent the effects of illumination 
changes. These feature vectors are then projected into a 100 
feature vector by means of PCA. A visual sample is show in 
Figure |7|a). 

•ARFaces classification: The ARFace database ED is com¬ 
posed of 26 face images from 126 different subjects (from 
which 50 are selected), portraying different expressions and 
complements. An example is shown in Figure [7jb). 

5.2 Methods and settings 

We compared the proposed Error-Correcting Factorization 
method, with the standard predefined One vs. All (OVA) and 
One vs. One (OVO) approaches (48), [52]. In addition, we 
introduce two random designs for ECOC matrices. In the first 
one, we generated random ECOC coding matrices fixing the 
general correction capability to a certain value (RAND). In 
the second, we generate a Dense Random coding matrix 0 
(DENSE). These comparisons enable us to analyze the effect 
of reorganizing the inter-class correcting capabilities of an 
ECOC matrix. Finally, in order to compare our proposal with 
state-of-the-art methods, we also used the Spectral ECOC (S- 
ECOC) method [571 and the Relaxed Hierarchy f23l (R-H) . 
Finally we propose two different flavors of ECF, ECF-H and 
ECF-E. In ECF-H we compute the design matrix D in order 
to allocate the correction capabilities on those classes that 
are hard to discriminate. On the other hand, for ECF-E we 
compute D allocating correction to those classes that are easy 
to discriminate. D is computed as the Mahalanobis distance 
between each pair of classes. Although, there exist a number of 
approaches to define D from data |23|, |58|, [57], i.e. the margin 
between each pair of classes (after training a One vs. One SVM 
classifier), we experimentally observed that the Mahalanobis 
distance provides good generalization and leverages the com¬ 
putational cost of training a One vs. One SVM classifier. All the 
reported classification accuracies are the mean of a stratified 
5—fold cross-validation on the aforementioned datasets. For 
all methods we used an SVM classifier with RBF kernel. The 
parameters C and 7 were tunned by cross-validation on a 
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validation subset of the data using an inner 2 —fold cross- 
validation. The parameter C was tunned on a grid-search on 
a log sampling in the range [ 0 , 10 10 ], and the 7 parameter 
was equivalently tuned on a equidistant linear sampling in 
the range [ 0 , 1 ], we used the libsvm implementation available 
at [[12]. For both ECF-H and ECF-E we run the factorization 
forcing different minimum distance between classes by setting 
P e 1 • {1,3, 5, 7,10} . For the Relaxed Hierarchy method [23| 
we used values for p e {0.1,0.4, 0.7, 0.9}. In all the compared 
methods that use a decoding function (e.g all tested methods 
but the one in ESI) we used both the Hamming Decoding (HD) 
and the Loss-Weighted decoding (LWD) 06). 

5.3 Experimental Results 

In Figure [8] we show the multi-class classification accuracy 
as a function of the relative computational complexity for 
all datasets using both Hamming decoding (HD) and Loss- 
Weighted Decoding (LWD). We used non-linear SVM clas¬ 
sifiers and we define the relative computational complexity 
as the number of unique Support Vectors (SVs) yielded for 
each method, as in l23l . For visualization purposes we use 
an exponential scale and normalize the number of SVs by 
the maximum number of SVs obtained by a method in that 
particular dataset. In addition, although the code length cannot 
be considered as an accurate measure of complexity when 
using non-linear classifiers in the feature space, it is the only 
measure of complexity that is available prior to learning the 
binary problems and designing the coding matrix. In this 
sense, we show in Figure [9] the classification results for all 
datasets as a function of the code length l, using both Hamming 
decoding (HD) and Loss-Weighted Decoding (LWD). Figures [ 8 ] 
and [9] and show how the proposed ECF-H obtains in most of 
the cases better performance than state-of-the-art approaches 
even with reduced computational complexity. In addition, in 
most datasets the ECF-H is able to boost the boundaries of 
those classes prone to error, the effect of this is that it attains 
higher classification accuracies than the rest of methods paying 
the prize of an small increase on the relative computational 
complexity. Specifically, we can see how on Glass dataset. 
Vowel , Yeast , Segmentation and Traffic datasets (Figs. §e)-(f) 
and [9je)-(f), respectively), the proposed method outperforms 
the rest of the approaches while yielding a comparable or 
even lower computational complexity, independently of the 
decoding function used. We also can see that the RAND and 
ECF-E methods present erratic behaviours. This is expected for 
the random coding design, since incrementing the number of 
SVs or dichotomies does not imply an increase in performance 
if the dichotomies are not carefully selected. On the other 
hand, the reason why ECF-E is not stable is not completely 
straightforward. ECF-E focus its design in dichotomies that are 
very easy to learn, allocating correction to those classes that 
are separable. We hypothesize that when these dichotomies 
become harder (there exists a limited number of easy separable 
partitions) to learn the addition of a difficult dichotomy harms 
the performance by adding confusion to previously learned 
dichotomies until proper error-correction is allocated. On the 
other hand, we can see how ECF-H usually shows a more 
stable behaviour since it focuses on categories that are prone 
to be confused. In this sense, we expect that the addition of 
dichotomies will increase the correction. Finally, it is worth 
noting that the Spectral ECOC method yields a code length 
of / = k — 1 , corresponding to the full eigendecomposition. 


TABLE 2 

Percentage of wins over all datasets for each method using as 
a complexity measure the number SVs and the number of 
classifiers. Last row shows the average complexity of each 
method over all datasets. Abbreviations: ECF-H (H), ECF-E 
(E), OVA (A), OVO (O), DENSE (D), RAND (R), S-ECOC(S). 


Method 

R-H * 

S 

H 

E 

D 

R 

A 

O 

Win % SVs 

0.0 

22.5 

62.1 

10.3 

50.0 

5.7 

14.2 

25.0 

Win % nclass. 

0.0 

48.5 

70.0 

17.5 

25.0 

6.9 

12.5 

16.6 

Avg. Comp. 

0.58 

0.87 

0.88 

0.89 

0.91 

0.92 

0.99 

0.99 


Our proposal defines coding matrices which ensure to follow 
the design denoted by D, fulfilling ECOC properties. 

As a summary, we show in Figure 10 a comparison in 


terms of classification accuracy for different methods over all 
datasets. We compare the classification accuracy of a selected 
method for both decodings (at different operating complexities 
if available) versus the best performing method in a range of 
±5% of the operative complexity. For consistency we show the 
comparison using both the number of SVs and the number of 
dichotomies as the computational complexity. If the compared 
method dominates in most of the datasets it will be found 
above the diagonal. In Figures [To|a) and (Mel) we compare 
ECF-H with the best performant of the rest of the methods and 
see that ECF-H outperforms the rest of the methods 62% — 70% 
of the times depending on the complexity measure. This im¬ 
plies that ECF-H dominates most of the methods in terms of 
performance by focusing on those classes that are more prone 
to error regardless of the complexity measure used (number of 
SVs or number of dichotomies). In addition, when repeating 
the comparison for ECF-E in Figures [lOjb) and |To|e) we see 
that the majority of the datasets are clearly below the diagonal 
(ECF-E is the most suitable choice 10% —17% of times). Finally, 
Figures [lOjc) and [lOjf) show the comparison for OVA, which 
is a standard method often defended by its simplicity 08). We 
clearly see how it never outperforms any method and it is 
not the recommended choice for almost any dataset. In Table [5] 
we show the percentage of wins for all method^] in increasing 
order of complexity averaged over all datasets. Note how, ECE- 
H denoted by H in the table although being the third less 
complex method outperforms by far the rest of the methods 
with an improvement of at least 12 % — 20 % in the worst 
case. In conclusion, the experimental results show that ECF-H 
yields ECOC coding matrices which obtain comparable or even 
better results than state-of-the-art methods with similar relative 
complexity. Furthermore, by a allowing a small increase in 
the computational complexity when compared to state-of-the- 
art methods, ECF is able to obtain better classification results 
by boosting the boundaries of classes that are prone to be 
confused. 


6 Conclusions 

We presented the Error-Correcting Factorization method for 
multi-class learning which is based on the Error-Correcting 
Output Codes framework. The proposed method factorizes a 
design matrix of desired correction properties into a discrete 
Error-Correcting component consistent with the design matrix. 
ECF is a general method for building an ECOC multi-class 
classifier with desired properties, which can be either directly 

6 . The R-H method |23) is far less complex than the compared 
methods, however we compare it to the to the closest operating 
complexity for each of the rest of the methods. 
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(a) Toy dataset HD (b) Toy dataset LWD (c) Ecoli dataset HD (d) Ecoli dataset LWD 



-A- Hh -► ■ ♦ 

ECF-H ECF-E S-ECOC R-H RAND DENSE OVO OVA 


Fig. 8. Multi-class classification accuracy (y axis) as a function of the relative computational complexity (x axis) for all datasets and 
both decoding measures. 
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ECF-H ECF-E S-ECOC R-H RAND DENSE OVO OVA 

Fig. 9. Multi-class classification accuracy (y axis) as a function of the number of dichotomies for all datasets and both decoding 
measures (x axis). 
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(a) 


(b) 




0.4 0.6 0.8 

best classification accuracy 


(d) (e) (f) 

Fig. 10. (a) Summary of performance of ECF-H method over all datasets using the number of SVs and the number of dichotomies 
as the measure of complexity, respectively for ECF-H (a)(d), ECF-E (b)(e) and OVA (c)(f). 


set by the user or obtained from data using a priori inter-class 
distances. We note that the proposed approach is not a replace¬ 
ment for ECOC codings, but a generalized framework to build 
ECOC matrices that follow a certain error-correcting criterion 
design. The Error-Correcting Factorization is formulated as a 
minimization problem which is optimized using a constrained 
Coordinate Descent, where the minimizer of each coordinate 
is the solution to a least-squares problem with box and linear 
constraints that can be efficiently solved. By analyzing the 
approximation error, we empirically show that although ECF 
is a non-convex optimization problem, the optimization is 
very efficient. We performed experiments using ECF to build 
ECOC matrices following the common trend in state-of-the- 
art works, in which the design matrix priorized the most 
separable classes. In addition, we hypothesized and showed 
that a more beneficial situation is to allocate the correction 
capability of the ECOC to those categories which are more 
prone to confusion. Experiments show that when ECF is used 
to allocate the correction capabilities to those classes which are 
prone to confusion we obtain higher accuracies than state of 
the art methods with efficient models in terms of the number 
of Support Vectors and dichotomies. 

Finally, there still exists open questions that require a deeper 
analysis for future work. The results obtained raise a fair doubt 
regarding the right allocation of error correcting power in 
several methods found in literature where ECOC designs are 
based on the premise of boosting the classes which are easily 


separable. In the light of these results, we may conjecture that 
a careful allocation of error correction must be made in such a 
way that balances two aspects: on one hand, simple to classify 
boundaries must be handled properly On the other hand, the 
error correction must be allocated on difficult classes for the 
ensemble to correct possible mistakes. In addition, it would be 
interesting to study which are the parameters that affect the 
suitability of the no class is left behind and the hard classes are 
left behind one. Finally we could consider ternary matrices and 
further regularizations. 
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